Feature Manipulation with Genetic Programming
نویسنده
چکیده
Feature manipulation refers to the process by which the input space of a machine learning task is altered in order to improve the learning quality and performance. Three major aspects of feature manipulation are feature construction, feature ranking and feature selection. This thesis proposes a new filter-based methodology for feature manipulation in classification problems using genetic programming (GP). The goal is tomodify the input representation of classification problems in order to improve classification performance and reduce the complexity of classification models. The thesis regards classification problems as a collection of variables including conditional variables (input features) and decision variables (target class labels). GP is used to discover the relationships between these variables. The types of relationship and the ways in which they are discovered vary with the three aspects of feature manipulation. In feature construction, the thesis proposes a GP-based method to construct high-level features in the form of functions of original input features. The functions are evolved by GP using an entropy-based fitness function that maximises the purity of class intervals. Unlike existing algorithms, the proposed GP-based method constructs multiple features and it can effectively perform transformational dimensionality reduction, using only a small number of GP-constructed features while preserving good classification performance. In feature ranking, the thesis proposes twoGP-basedmethods for ranking single features and subsets of features. In single-feature ranking, the proposed method measures the influence of individual features on the classification performance by using GP to evolve a collection of weak classification models, and then measures the contribution of input features to the making of good models. In ranking of subsets of features, a virtual structure for GP trees and a new binary relevance function is proposed to measure the relationship between a subset of features and the target class labels. It is observed that the proposed method can discover complex relationships—such as multi-modal class distributions and multivariate correlations—that cannot be detected by traditional methods. In feature selection, the thesis provides a novel multi-objective GPbased approach to measuring the goodness of subsets of features. The subsets are evaluated based on their cardinality and their relationship to target class labels. The selection is performed by choosing a subset of features from a GP-discovered Pareto front containing suboptimal solutions (subsets). The thesis also proposes a novel method for measuring the redundancy between input features. It is used to select a subset of relevant features that do not exhibit redundancy with respect to each other. It is found that in all three aspects of feature manipulation, the proposed GP-based methodology is effective in discovering relationships between the features of a classification task. In the case of feature construction, the proposed GP-basedmethods evolve functions of conditional variables that can significantly improve the classification performance and reduce the complexity of the learned classifiers. In the case of feature ranking, the proposed GP-based methods can find complex relationships between conditional variables and decision variables. The resulted ranking shows a strong linear correlation with the actual classification performance. In the case of feature selection, the proposed GP-based method can find a set of sub-optimal subsets of features which provids a trade-off between the number of features and their relevance to the classification task. The proposed redundancy removal method can remove redundant features from a set of features. Both proposed feature selection methods can find an optimal subset of features that yields significantly better classification performance with a much smaller number of features than conventional classification methods. Produced Publications 1. Kourosh Neshatian, Mengjie Zhang, and Mark Johnston. “Feature Construction andDimension ReductionUsingGenetic Programming”. Proceedings of the 20th Australian Joint Conference on Artificial Intelligence (AI’07), Lecture Notes in Artificial Intelligence, Vol. 4830, Springer, Gold Coast, Australia, December 2007. pp 160-170. 2. Kourosh Neshatian andMengjie Zhang. “Genetic Programming and Class-Wise Orthogonal Transformation for Dimension Reduction in Classification Problems”. Proceedings of the 11th European Conference on Genetic Programming (EuroGP 2008), Lecture Notes in Computer Science, Vol. 4971, Springer, Napoli, Italy, March 2008. pp 242-253. 3. Kourosh Neshatian and Mengjie Zhang. “Genetic Programming for Performance Improvement and Dimensionality Reduction of Classification Problems”. Proceedings of the 2008 IEEE World Congress on Computational Intelligence (CEC’08), IEEE Press, Hong Kong, June 2008. pp 2811-2818. 4. Kourosh Neshatian, Mengjie Zhang, and Peter Andreae. “Genetic Programming for Feature Ranking in Classification Problems”. Proceedings of the seventh International Conference on Simulated Evolution and Learning (SEAL’08), Lecture Notes in Computer Science, Vol. 5361, Springer, Melbourne, Australia, December 2008. pp 544-554.
منابع مشابه
A Fast and Self-Repairing Genetic Programming Designer for Logic Circuits
Usually, important parameters in the design and implementation of combinational logic circuits are the number of gates, transistors, and the levels used in the design of the circuit. In this regard, various evolutionary paradigms with different competency have recently been introduced. However, while being advantageous, evolutionary paradigms also have some limitations including: a) lack of con...
متن کاملA Multi-objective Genetic Programming Biomarker Detection Approach in Mass Spectrometry Data
Mass spectrometry is currently the most commonly used technology in biochemical research for proteomic analysis. The main goal of proteomic profiling using mass spectrometry is the classification of samples from different clinical states. This requires the identification of proteins or peptides (biomarkers) that are expressed differentially between different clinical states. However, due to the...
متن کاملReasons for the Creation of the New Coronavirus 2019 (SARS-CoV2): Natural Mutation or Genetically Laboratory Manipulation-Point of View
Background and Objectives: Following the emergence of COVID-19 caused by the SARS-CoV2, the reasons for the emergence of the novel virus have been the subject of interest for molecular biology researchers and news agencies. This article attempted to emphasize all aspects of the emergence of this virus and discuss the latest information related to its development. Materials and Methods: From th...
متن کاملA Parallel Genetic Algorithm Based Method for Feature Subset Selection in Intrusion Detection Systems
Intrusion detection systems are designed to provide security in computer networks, so that if the attacker crosses other security devices, they can detect and prevent the attack process. One of the most essential challenges in designing these systems is the so called curse of dimensionality. Therefore, in order to obtain satisfactory performance in these systems we have to take advantage of app...
متن کامل